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Abstract — Tree shape statistics quantify some aspect of the 
shape of a phylogenetic tree. They are commonly used to compare 
reconstructed trees to evolutionary models and to find evidence 
of tree reconstruction bias. Historically, to find a useful tree 
shape statistic, formulas have been invented by hand and then 
evaluated for utility. This article presents the first method which 
is capable of optimizing over a class of tree shape statistics, called 
Binary Recursive Tree Shape Statistics (BRTSS). After defining the 
BRTSS class, a set of algebraic expressions is defined which can 
be used in the recursions. The tree shape statistics definable using 
these expressions in the BRTSS is very general, and includes 
many of the statistics with which phylogenetic researchers are 
already familiar. We then present a practical genetic algorithm 
which is capable of performing optimization over BRTSS given 
any objective function. The chapter concludes with a successful 
application of the methods to find a new statistic which indicates 
a significant difference between two distributions on trees which 
were previously postulated to have similar properties. 

Index Terms — Biology and genetics, Evolutionary computing 
and genetic algorithms 

Tree shape statistics are numerical summaries of some 
aspect of the shape of a phylogenetic tree. The first tree shape 
statistic was the N of Sackin [1], where the explicit goal was to 
numerically describe the balance of a tree, which is the degree 
to which daughter subtrees of internal nodes are of similar or 
different size. Trees which are balanced have smaller N than 
do trees which are imbalanced. Many other tree shape statistics 
followed, all quantifying balance; a review of this literature can 
be found in the excellent review article by Mooers and Heard 
[2]. 

The next important step in tree shape theory was made 
by Kirkpatrick and Slatkin [3] who wondered which statistics 
were the most powerful to distinguish between the so-called 
ERM and PDA distributions on trees. The statistics which 
they chose to rate included most of the statistics available 
in the literature at that time. Their article is among the 
most influential in the area of tree shape, with over 60 
citations as of March 2006 (ISI Web of Knowledge search, 
http://portal.isiknowledge.com/). 

The article by Kirkpatrick and Slatkin marked a philosoph- 
ical shift from the idea of a tree shape statistic as a purely 
descriptive device to that of a mapping which can be used in 
a statistical fashion. Their work was continued more recently 



by Agapow and Purvis [4] who took seven tree shape statistics 
from the literature and one of their own, then tested them for 
power in distinguishing several different models. They then 
made general recommendations for which statistics to use. 

The next step for the Kirkpattick-Slatkin methodology 
needs to overcome two limitations. First, the statistics which 
are tested are typically invented "by hand" and so are lim- 
ited by the ingenuity of individual authors. Second, general 
recommendations may not be sufficient for all situations in 
which tree shape statistics are useful. For example, although 
a statistic such as Colless' index [5] has lots of power in the 
Kirkpatrick-Slatkin and Agapow-Purvis scenarios, it has low 
power to distinguish between two distributions which have 
similar overall balance [6]. 

This paper presents a methodology which enables, for the 
first time, direct optimization over tree shape statistics. First, 
we present a recursive framework and a class of algebraic 
expressions which can be used to define tree shape statistics 
in a natural way. These statistics are a large and varied family 
which include most of the present-day free shape statistics. 
Second, this paper presents a practical genetic algorithm 
which, given an objective function, can be applied to produce 
high-performance tree shape statistics. 

For the purpose of this paper, tree refers to a finite rooted 
bifurcating tree without leaf labels or edge lengths. 

I. Binary recursive tree shape statistics 

A. Definition and examples 

This section defines binary recursive tree shape statistics 
(BRTSS) which form the framework over which the opti- 
mization algorithms operate. The starting observation for the 
definition is that many extant statistics are constructed with 
reference to their values on subtrees. For example, the number 
of leaves of a tree can be calculated recursively by summing 
the number of leaves of its two subtrees. Using X^Y to 
signify a tree with X and Y as subtrees, one can write this 
statement as 

l(X^Y) = l{X) + l(Y). 
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One can write a tree shape statistic of this sort by specifying 
a "recursion" p and a "base case" A, 



s(T) 



p(s(X),s(Y)) if T = X~Y 
A if T is a leaf 



(1) 



Because X " Y = Y " X, the resulting s is well defined only if 
p is symmetric, i.e. if p(x, y) = p(y, x). In the above notation 
the number of leaves of a tree can be written as a recursive 
tree shape statistic with A = 1 and p(x, y) = x + y. Another 
example of this sort of statistic is the maximal depth of the 
tree, for which A = and p(x, y) = 1 + max(i, y). 

A remarkable number of useful statistics can be achieved 
by varying p and A. However, considerably more can be writ- 
ten using several mutually recursive statistics. For example, 
perhaps the most popular tree shape statistic is the "Colless 
index" I c [2] [5]. This index (without a normalizing factor) 
sums for each internal node the absolute value of the difference 
between the number of leaves of the two daughter subtrees of 
that internal node. It can be written as follows: 



Io(T) 



I c {X) + I c (Y) + \l(X)-l(Y)\ if T = X~Y 




if T is a leaf. 



This version of I c is constructed from two real numbers (the 
base cases) and two functions symmetric in X and Y (the 
recursions). This leads to the definition of a BRTSS. 

Definition 1: A BRTSS of length n is an ordered pair (A, p) 
where A G R" and p is an n-vector of Symm 2 (R") — > M 
maps. 

In the definition, Symm 2 (R™) denotes the symmetric product 
of R" with itself. The condition that the pi map from the 
symmetric product is equivalent to saying that they map 
R 2 ™ — > R and are invariant under the action exchanging the 
Xi's and the t/j's, i.e. for any j 

Pj (x! , . . . , x n , y 1} . . . , y n ) = pj (yt, . ■ ■ , y n , m, ■ ■ ■ , x n )- 

(2) 

A BRTSS is evaluated on a tree by a generalization of Q. 
Recursively define the Si by 



Si{T) 



Pl ( Sl (X),..., Sn (X), Sl (Y),..., Sn (Y)) 

A,; 



(3) 



where the first case is used in case T = X ~ Y , and the second 
if T is a leaf. The final value of the BRTSS on T is simply 
defined to be si(T). The symmetry property of the pi imply 
that Q is well defined. For this paper, x% (resp. j/j) will be 
used for the value of Sj on the subtree X (resp. Y). 

The Colless index (without the normalizing factor) can now 
be written as a BRTSS of length 2 with the base cases Ai = 0, 
A2 = 1, and the two recursions 



Pi(xi,x 2 ,yi,y 2 ) 
P2(xi,x 2 ,yi,y2) 



xi+yi + \x 2 - y 2 \ 

%2 + Vl- 



The second recursion p 2 for I c simply totals the value of s 2 
applied to subtrees X and Y. With the base case A2 = 1, this 
implies that s 2 gives the number of leaves of the tree as before. 
The first recursion p\ adds the absolute value of the difference 
of s 2 applied to the subtrees to the sum of the values of S\ 
applied to the subtrees. This is indeed the (un-normalized) 
Colless index I c as described above. 



The BRTSS formulation can be used to define many tree 
shape statistics from the literature with simple recursive func- 
tions p. For example, we show here how to define the number 
of two leaf subtrees of a tree (called the number of "cherries" 
[7]), and un-normalized versions of Sackin's N [1], and Shao 
and Skokal's B\ [8]. The latter two can be defined as follows. 
Let X denote the internal nodes and r denote the root of a 
tree. For i 6 I, let Ni be the number of leaves of the subtree 
subtended by i. For a node j 6 X— {r}, let Mj be the maximal 
depth of the subtree with j as the root. Then 



m; 



(4) 



The above formulas for the number of cherries Ch, N, and 
Bi, respectively, can be written in BRTSS form as 1 

Ch:((0,l),(x 1 +y 1 +I(x 2 +y 2 ,2), x 2 + y 2 )) 
N :((0,l),(n +yi + x 2 + y 2 , x 2 + y 2 )) 

1-1(32,0) 



B x : (0,0), xi +yi + 



x 2 

1-/(2/2,0) 

V2 



1 + max(xi, yi) 



In the above / denotes the binary indicator function, i.e. 
J(a, b) is one if a = b and zero otherwise. We note that 
formulae for I c and N similar to the above have been 
published previously in [9]. 

The emphasis in this paper will be on BRTSS with reason- 
ably simple p, however, it is true that any mapping of trees 
to the real line can be written as a BRTSS using sufficiently 
complex p. Begin by enumerating the (countable) set of trees 
and define s 2 (T) to be the number of a given tree T. This can 
be written recursively by setting p 2 (xi,yi,x 2 ,y 2 ) to be the 
number of the tree which has the trees numbered x 2 and y 2 
as subtrees. The function pi simply gives the desired value of 
the statistic on the tree composed of the two subtrees numbered 
x 2 and y 2 . This statistic is a BRTSS by definition. 

B. Verifiably symmetric algebraic expressions 

The primary aim of this paper is to demonstrate a system 
capable of optimizing over a class of tree shape statistics. 
The previous section defined the BRTSS class, which defines 
a tree shape statistic in a natural way given a real vector 
A G M™ and p, an n- vector of Symm 2 (R") — > R maps. The 
promised optimization will proceed by modifying the A and 
p vectors. Optimizing over n-dimensional real vectors is a 
classical subject, however optimization over such symmetric 
maps generally is not. Any class of such Symm 2 (R") — > R 
maps could in principle be used as a set for enumeration and 
optimization, however a balance must be struck between ease 
of optimization and generality. For instance, one could easily 
use symmetric linear functions as the underlying recursions 

1 For BRTSS evaluation we will use the convention that 0/0 = 0. This allows 
for more flexibility in the use of division. 
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and adjust the coefficients in order to find statistics with de- 
sirable properties. However, this rather restrictive class would 
exclude all of the above BRTSS except for I and N. 

The purpose of this section is to introduce a subset of 
the Symm 2 (R") — > M functions which is quite general 
though sufficiently simple to be the underlying population for 
a genetic algorithm. This subset is functions induced by a 
class of algebraic expressions with certain allowed operations 
and operands. The challenge lies in ensuring the symmetry 
property (0. 

The basic idea of this class of algebraic expressions, which 
will be called verifiably symmetric algebraic expressions, is 
simple: we constrain the algebraic expressions to remain the 
same (up to the order of operands of commutative operations) 
after exchanging the x% for the j/j. For example, exchanging x\ 
for 2/1 in x% + y\ gives y\ + xi, which is equivalent to x\ + y\ 
after applying the commutative rule for addition. Therefore, 
x\ + yi is considered to be verifiably symmetric. Similarly, 
min(a;i/j/2, yi/x?) is also verifiably symmetric because min 
is a commutative binary operation. On the other hand, the 
algebraic expression * x\ is not verifiably symmetric even 
though it induces a symmetric function of Xi and y±. The 
verifiably symmetric criterion clearly implies that the induced 
functions carry the symmetry property (0. 

The set of finite verifiably symmetric algebraic expressions 
is a convenient set over which optimization is possible. One 
could use a larger set of expressions with a more complex 
notion of symmetry, however, this might require consideration 
of the full problem of simplification of algebraic expressions. 
The simplification of algebraic expressions is a subtle field in 
itself [10] [11] and thus generalizing the definitions might not 
aid our purpose of finding a simple and useful framework for 
tree shape statistics. 

We now present a more rigorous version of the above 
definition. 

Definition 2: An algebraic expression on sets C of con- 
stants, V of variables, U of unary operations, and B of binary 
operations is one of the following: 

« a constant from C 

• the instantiation of a variable from V 

• a unary operation from U applied to an algebraic expr. 

• a binary operation from B applied to an ordered pair of 
algebraic expressions. 

A variable is different than its instantiation, as one variable 
may have many distinct instantiations. Equality of algebraic 
expressions is defined recursively in the natural way. 

Note that the standard rules of simplification and equiva- 
lence are not automatic. All binary operations have an order 
(thus x + y is not equal to y + x), there is no notion of 
associativity, and no simplification is done at this stage. 

To construct algebraic expressions for the p of the BRTSS, 
this paper uses the integers as the set of constants and 
xi, . . . , x n , yi, . . . , y n as the set of variables, where n is 
the length of the BRTSS. The standard binary operations 
+, — , * and / will be used, as well as the binary indicator 
function /, exponentiation, and max. Of these, +,*,/, and 
max are considered to be commutative. The unary operations 
used are inverse, negation, absolute value, exp, (natural) log, 



and the symmetrization of any commutative binary operation, 
which is described below. In the following the term "algebraic 
expression" will be used without qualification as C, V, U, and 
B have now been fixed. Although the definitions below do not 
depend on these choices, the examples will. 

Definition 3: Two algebraic expressions R and S are com- 
mutatively equivalent, denoted R = S, if R can be obtained 
from S by changing the order of operands in commutative 
binary operations. 

Denote by <fi the map exchanging xi for yi in the expres- 
sions. Recall that this is the map used to define the symmetry 
property (0 of the p in the definition of the BRTSS. 

Definition 4: An algebraic expression E is verifiably sym- 
metric if E = <j>(E). 

Given the choice of operations, examples of verifiably sym- 
metric algebraic expressions can be found in the above defini- 
tions of Ch, N, and B\. However, the expression \xi— y^\ in I c 
is not verifiably symmetric using our choice of operations even 
though the usual algebraic simplification leads to equivalence 
between |x2 — an d its image under <f>. Of course, if we 
had decided to include the absolute value of a difference as a 
commutative binary operation in the set B then |x2 — 2/2 would 
be considered verifiably symmetric. Nevertheless, \x2~ y<i\ can 
be written max(x2~ yi, 2/2—^2) which is verifiably symmetric. 
Therefore, I c can indeed be written as a BRTSS with verifiably 
symmetric recursions. 

Because of the strict hierarchy of containment set up by 
the definition of algebraic expressions, the notions of sub- 
expression and "smallest expression" containing a subexpres- 
sion are well defined. 

Definition 5: The minimal fixed expression M(z; E) of the 
instantiation z of a variable appearing in a verifiably symmet- 
ric algebraic expression E is the smallest sub-expression of E 
containing z which is verifiably symmetric. 
For example, M(xi; 2 * log(min(xi, yi)) is min(xi,yi). 

A minimal fixed expression clearly cannot be a constant. 
Because it is verifiably symmetric, it cannot be a variable 
instantiation. By minimality, it cannot be a unary operation 
applied to a subexpression. Therefore it must be of the form 
E z -k F, where the variable instantiation z is contained in E z 
and * is a binary operation. 

Furthermore, since (f>(E z -kF) = E Z *F, either <j>{E z ) = E z 
or 4>{E Z ) = F. The first option is not possible: otherwise 
E z -k F would not be minimal. Therefore the minimal fixed 
expression M(z; E) of any instantiation z is of the form 
E z -k F where z is contained in E z and <f>(E z ) = F. 
This implies further that * is commutative. Because of the 
symmetry, it is possible to only store one "side" of the minimal 
fixed expression, the other side being available through <j>. 
In the following terminology, any minimal fixed expression 
is commutatively equivalent to an expression written with a 
symme triza tion: 

Definition 6: The symmetrization S+(E) of an expression 
E with respect to a commutative binary operation * is E * 
4>{E). 

For example, x\ +yi can be written 5+(xi), and max(si — 
2/i, j/i — xi) can be written S , max (xi — yi). The symmetrization 
of a binary operation is a unary operation. If every variable 
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instantiation in an expression is contained within at least 
one symmetrization, then we will say that the expression is 
completely symmetrized. For example, 5* (^2) is completely 
symmetrized, while max(xi, 5* (0:2)) is not. 

Every variable instantiation in a verifiably symmetric al- 
gebraic expression is included in a minimal fixed expression 
by definition, and each such minimal expression can be 
written with a symmetrization up to commutative equivalence. 
Therefore 

Proposition 1: Any verifiably symmetric expression is 
commutatively equivalent to a completely symmetrized alge- 
braic expression. 

This simple proposition allows for a compact "grammar" of 
verifiably symmetric algebraic expressions and a trivial way 
for optimization algorithms to modify algebraic expressions 
while staying within the verifiably symmetric class. The rest 
of this paper will consider completely symmetrized algebraic 
expressions as the expressions defining the pi. 

The value of BRTSS can be computed by free software from 
http://math.canterbury.ac.nz/matsen/simmons/ 

II. Enumeration and Optimization 
A. Enumeration 

With this framework it is possible to enumerate many alge- 
braic expressions and test them for desirable properties. This 
idea was implemented as follows. First define the "size" of an 
algebraic expression to mean the total number of operations 
and operands of the expression: for example, the expression 
2 + Xi has size 3. The symmetrization of a commutative 
binary operation is unary and thus adds only one to the size. 
Second, select a set of constants, variables, unary operations 
and binary operations for enumeration. These can be subsets 
of the complete set allowed for BRTSS recursions. 

For each k up to a maximal size, two lists are constructed: 
one of completely symmetrized algebraic expressions and 
another of non-symmetrized expressions. To construct the 
completely symmetrized expressions of size k, all unary oper- 
ations are applied to the completely symmetrized expressions 
of size k — 1, then all symmetrizations are applied to all non- 
symmetrized expressions of size k — 1, then all binary oper- 
ations are applied to all pairs of completely symmetrized ex- 
pressions of total size k — 1. To construct the non-symmetrized 
expressions of size k, all unary operations are applied to the 
non-symmetrized expressions of size k—1, then all binary 
operations are applied to all pairs of completely symmetrized 
and non-symmetrized expressions of total size k — 1, then all 
binary operations are applied to all pairs of non-symmetrized 
expressions of total size k — 1. 

For k = 1, the completely symmetrized algebraic expres- 
sions are taken to be the chosen set of constants, and the 
non-symmetrized algebraic expressions are instantiations of 
the variables. In the present application some limited forms of 
simplification were implemented to eliminate double negation 
and similar obvious redundancies. 

The number of statistics constructible using direct enu- 
meration is large. We enumerated all statistics of length 
one, size less than or equal to seven, constants taken from 



the set {0,1,2}, variables x and y, and operations as in 
Section H-Bl except for subtraction and division, which can be 
expressed using combinations of operations. After removing 
those statistics which are constant on all trees on eight leaves, 
516,699 statistics remained. The number of analogous BRTSS 
with length larger than one is considerably larger. 

B. Genetic Algorithm 

Genetic algorithms typically optimize over a very large 
discrete space by maintaining a population of elements of 
that space and allowing reproduction based on the value of 
the function to be optimized [12]. Some notion of mutation 
and crossover are defined such that the population changes 
over time. Here the underlying space is taken to be the set of 
BRTSS with integral A^ and algebraic expressions as in II-BI 
for the pi. We will assume for this section that the BRTSS 
under consideration have length n. 

Standard Wright-Fisher sampling [13] was applied for re- 
production. When the objective was to maximize a positive 
number, the raw fitness function was simply that number. 
When the objective was to minimize a number between zero 
and one, such as a p-value, the negative of the logarithm of 
the objective function was used as the raw fitness. 

Two types of mutation were defined: mutation of A and 
mutation of p. A mutation of A simply chooses an i £ 
{1, . . . , n} uniformly and then adds or subtracts one from A^. 
A mutation of p also chooses a pi uniformly to mutate. A 
mutation of a pi can be either an insertion, modification, or 
a deletion. An insertion can occur to the whole expression 
or to any sub-expression, and involves replacing / by either 
u(f) for some unary operation u or by replacing / with f *t, 
where Hsa constant or a variable instantiation and * is some 
binary operation. A modification uniformly selects a random 
operation or operand from the expression and modifies it in 
place. Binary (resp. unary) operations can be modified to be 
any other binary (resp. unary) operation. Constants increase 
or decrease by one. Variables either change from an Xi to yi 
(or vice versa) or the index is increased by 1, wrapping back 
to 1 when appropriate. Deletion can act on a binary operation 
or a unary operation. A unary operation u(f) is replaced by 
/, and a binary operation / * g is replaced by a uniform 
selection of / or g. The distributions on the above choices 
can be chosen arbitrarily, however for the present applications 
the distributions were all taken to be uniform. 

Some of these mutations can transform a completely sym- 
metrized algebraic expression to one which is not. In this 
case, first all the locations for symmetrized operations which 
would symmetrize a subexpression are found. Then one is 
uniformly chosen among these locations and a uniformly 
chosen symmetrized operation is applied. If the resulting 
expression is still not completely symmetrized the process is 
repeated until it is. 

Crossover was defined analogous to chromosome sort- 
ing in diploid organisms. Given two BRTSS, one called 
"heads" and the other "tails", sample a Bernoulli ran- 
dom variable for each i and choose the corresponding pi 
and Xi for the first product of the crossover. The other 
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product is obtained by using the compliment. For exam- 
ple, if the sample is HT for ((Ah,i, Ah, 2 ), (ph,i, Ph,%)) 
crossed with ({Xt.i, ^t 2 ), {pt.i, Pra))^ me resulting BRTSS 
are ((Ah,i, At 2 ), (ph,i,Pt,2)) and ((At,i,Ah 2 ), (fir,i, Ph,i))- 
In order to avoid overly long BRTSS, it is possible to 
discount the fitness of a BRTSS according to its size. Specif- 
ically, rather than the raw fitness function F(q) one can use 
F(q) - tpS(q) where 5(g) is the total size of the BRTSS and 
ip is a scaling factor. It is also possible to have ijj change after 
a number of generations of the genetic algorithm. 

C. Workflow 

Here we describe the strategy for producing high- 
performance statistics using the above methodology. First, an 
objective function must be chosen which is representative of 
the problem but which is not too costly to compute. For 
instance, to find a statistic which can differentiate between 
two distributions on trees, a compromise must be found for 
sample size. Too small of a sample may just pick up sampling 
differences, yet too large of a sample significantly slows down 
computation. 

Second, enumeration is used to find a good initial population 
for the genetic algorithm. Early efforts demonstrated that the 
genetic algorithm was excellent at finding local optima, but 
that it had difficulty traversing the whole fitness landscape. 
A solution is to start with a diverse population of statistics, 
which can be found using the method of Section ITl-AI Many 
statistics are enumerated and then sorted by their performance; 
a selection of the best is then used as an initial population. 

Third, the genetic algorithm is run with a variety of pa- 
rameters and random seeds. The resulting statistics are then 
collected and rated against one another and the best ones 
found. 

The algorithm has been implemented in an ocaml 
[14] program; complete source code is available at 
http : //math . c anterbury . ac . nz/matsen/ 

D. Overfitting 

The number of verifiably symmetric algebraic expressions — 
even of moderate size and with a small selection of constants — 
is enormous. The number of binary recursive tree shape 
statistics constructible with these algebraic expressions is of 
course significantly larger. For this reason some caution is 
needed to avoid "overfitting" the statistical problem at hand. 
For example, the method described here can quite easily find 
a statistic which seems to indicate a significant difference be- 
tween two moderately-sized draws from the same distribution 
on trees. 

This problem can be approached in the following ways. 
First, in the applications, we have split the data into "training" 
and "testing" data, such that statistics are evolved on the train- 
ing data, and then their significance is indicated on the testing 
data. If the testing data is of reasonable size, it is unlikely that 
observed statistical significance is due to sampling. Second, 
one can reduce the overfitting problem by incorporating size 
into the fitness function as described above. This tends to keep 
the statistics in a more manageable range. Finally, statistics 



with only one recursion are less likely to overfit than those 
with multiple recursions; for this reason we have restricted 
ourselves to the single-recursion case in the below application. 

III. Application 

In this section we apply the methods described in the 
previous chapter to perform a re-analysis of the results from 
a recent paper by Blum and Francois [15]. The main purpose 
of their paper was to investigate an earlier suggestion by 
David Aldous that an instance of his "beta-splitting" model 
might approximate the distribution of macroevolutionary phy- 
logenetic trees reconstructed from sequence data [16]. Blum 
and Francois confirm his suggestion, "that the [imbalance 
measures] generally agree with a very simple probabilistic 
model: Aldous' Branching." These models are explained be- 
low. The conclusion of the example application in this paper 
will be that although the sampled trees do fit the "Aldous' 
Branching" model reasonably well in terms of overall balance, 
it is possible to find a tree shape statistic which demonstrates 
a substantial deviation from the Aldous model. 

The "Aldous' Branching" model is an instance of a one- 
parameter family of models invented by David Aldous called 
the "beta-splitting" models. These models are simply proba- 
bility distributions on trees and are not intended to model any 
evolutionary process. The idea of the beta-splitting model is 
to recursively split the taxa into subclades using a distribution 
derived from the beta distribution. More precisely, assuming 
that a clade has n taxa, the probability of the split being 
between subclades of size i and n — i is 



<ln,p{i) = C{n;(3) 



r(/3 + i+l)r(/3 + n-i+l) 



r(* + l)r(n - i + 1) 

where C(n;/3) is a normalizing constant. The parameter (3 
in Aldous's model thus determines the overall balance of the 
trees, such that larger values of (3 lead to increased balance. 
The so-called "equal rates Markov" (ERM) model corresponds 
to (3 = 0, and the "proportional to different arrangements" 
(PDA) model results when (3 is set to —1.5. The model when 
j3 is set to —1 is called the "Aldous' branching" model by 
Blum and Francois, but we will simply call it the (3 = — 1 
model. 

Blum and Francois took a sample of trees from the tree 
database TreeBASE [17] and found a maximum-likelihood 
estimate of (3 for each of these trees. Because not all of 
the trees are binary, they resolved multifurcating nodes (also 
called polytomies) by splitting them either via the ERM model 
("ERM-solved" trees) or via the PDA model ("PDA-solved" 
trees). They felt that the inclusion of outgroups might skew 
the analysis, and thus passed the trees through an "automated 
outgroup removal procedure" which simply removes leaves or 
cherries (subtrees with two leaves) branching off of the root. 

The general strategy taken in this section will be to compare 
the same trees used by Blum and Francois to a sample from the 
j3 = — 1 (a.k.a. "Aldous' branching") distribution. Specifically, 
for each ERM-solved TreeBASE tree in their set after the 
outgroup removal procedure, we sample a tree of the same 
size from the (3 = — 1 model. This provides a paired data 
set which is appropriate for paired statistical tests such as the 
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sign test. As described in Section III-DI we divide the data 
into training and testing subsets. In this case the trees were 
numbered starting from zero and the even numbered trees 
taken for training and the odd numbered trees taken for testing, 
resulting in 1032 trees for the training set and 1031 trees for 
the testing set. 

We first review the statistic used by Blum and Francois to 
compare the TreeBASE trees and the corresponding model 
trees. They define 

where as before Ni is the number of leaves of the subtree 
subtended by internal node i. We applied this statistic to the 
paired data set, which led to a p-value of 0.362 with the sign 
test. Therefore through the eyes of the Blum and Francois 
s statistic, the (3 = — 1 model indeed does a good job of 
producing trees similar to those found in TreeBASE. 

The goal for the rest of this section will be to find a statistic 
which does indicate a significant statistical difference between 
the j3 = — 1 trees and the TreeBASE trees. Accordingly, the 
objective function applied to a chosen statistic was chosen to 
be the negative of the logarithm of the p-value of the sign 
test of the statistic applied to the aligned data. The recipe 
from Section III-CI was followed. In the enumeration phase, 
all statistics of length one and size up to five, with constants 
and Xi chosen from the set {0, 1, 2}, were tested and the best 
used as an initial population. The genetic algorithm was run 
with population sizes of 50 and 100, mutation rate of 20% per 
generation, and 1500 generations. 

We will focus on one statistic returned from the algorithm, 
which will be called ib. The ip statistic has A = 8 and 
p(x,y) — (log(x + y)) 5 . This statistic rejects the (3 = — 1 
model with a p-value of 6.78 x 10 -19 for the paired sign test 
on the testing data. Therefore this statistic clearly indicates 
an important difference between the beta-splitting and the 
reconstructed trees. 

Although the ip statistic was developed in order to differ- 
entiate between the TreeBASE trees and the f3 = — 1 trees, 
it does a good job of differentiating between the sample of 
TreeBASE trees and samples from the beta-splitting model 
for a range of (3 values. As seen in Figure the ip statistic 
rejects the beta-splitting model for a variety of values of j3 
with a very low p-value, while the s statistic only rejects the 
beta-splitting model when (3 is rather far away from —1. 

We will now sketch some ideas of how the ip statistic 
might "work." An interesting feature of this statistic is that 
it converges on sequences of trees of increasing size satisfy- 
ing certain conditions. For example, its value on an infinite 
balanced tree is approximately 5.05 x 10 5 , and on an infinite 
"comb" (perfectly imbalanced) tree its value is approximately 
3.32 x 10 5 . The reasons for this are clear from Figure [2] 
The statistic can be evaluated on a large balanced tree by 
recursively iterating the function x i— » (\og(2x))° ; from the 
plot it is clear that this recursion will converge to a value 
slightly more than 5 x 10 5 . For the comb tree the recursion is 
x i— ► (log(x + 8)) 5 , and the convergence value can again be 
estimated from the plot. 



Rejection levels for the beta-splitting models 
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Fig. 1. A comparison of the if) statistic considered in this paper and the s 
statistic of Blum and Francois. The x axis is the value of /3 used to sample 
trees, and the y axis is the negative base 10 logarithm of the p-value of the 
sign test applied to the sample from TreeBASE and samples of the beta- 
splitting model. Clearly the s statistic has low power to distinguish between 
the two samples for a range of j3, while the ip statistic has high power within 
this range. 
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Fig. 2. The recursion from ip. 



However, as can be seen from Table|l] the convergence is not 
immediate. Furthermore, for small trees the statistic increases 
with imbalance (compare the balanced tree of depth three to 
the comb of depth seven, each of which have eight leaves), 
whereas on large trees the statistic increases with balance. This 
implies that the exchange of a comb subtree for a balanced 
subtree in a small tree can increase the statistic, whereas the 
same exchange in a large tree may decrease the statistic. This 
feature suggests that the TreeBASE trees may deviate from 
the "Markov branching" property of the beta-splitting models, 
which is that the distribution on subtrees of a given size is 
independent of the rest of the tree. 

At this point it is important to emphasize that the ip statistic 
was invented for the single purpose of distinguishing the beta- 
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TABLE I 
Calculation of the tp statistic 



depth 
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comb 
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164 


94.7 
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6.52e+03 


2.04e+03 
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7.64e+04 


2.58e+04 
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2.42e+()5 


1.08e+05 
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3.85e+05 


2.09e+05 
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4.57e+()5 


2.76e+05 


7 


4.87e+()5 


3.09e+05 



splitting trees from the sorts of trees one finds in TreeBASE. 
This statistic was named for convenience only, not to introduce 
it into the canon of tree shape statistics. Indeed, one intent of 
this paper is to reduce the traditional emphasis on individual 
"general purpose" tree shape statistics and to focus instead on 
creating statistics for a specific application. 

There will often be many such useful statistics. For example, 
note that a number of statistics appeared on different runs of 
the same objective function with similar performance. Because 
the space of algebraic expressions is so large and the fitness 
landscape is very "peaked," runs of the genetic algorithm 
seldom converge on the same BRTSS when started with a 
different random seed or slightly different parameters. 

The results of this section should not be construed as a rejec- 
tion of the results or methodology of Blum and Francois. They 
found a value for (3 which does in fact generate the observed 
level of overall balance for the TreeBASE trees. However, the 
above statistic shows that in this case there is more to tree 
shape than just overall balance. The difference between the two 
perspectives indicates interesting future directions for research. 
For example, is the observed difference due to reconstruction 
bias, or is the deviation indicated by the above statistic an 
actual feature of macroevolutionary processes? If the latter, 
how can we modify the present models to accommodate the 
difference? 

IV. Conclusions 

In conclusion, we have developed a framework which allows 
enumeration of and optimization over a class of tree shape 
statistics. This class includes many of the tree shape statistics 
found in the literature. A genetic algorithm can be applied in 
this framework to find customized tree shape statistics for a 
certain application. The methodology is applied in an example 
case, finding a statistic which indicates a significant difference 
between two distributions on trees which was not previously 
evident. 

Along with this new tool comes a new problem, which 
is that an automated system such as the genetic algorithm 
described above can create very complex tree shape statistics 
whose values can be hard to interpret. This issue is not 
problematic from an abstract statistical viewpoint, however it 
is comforting to have an intuitive interpretation of the statistics. 
In the sample case above some intuition was developed about 
a relatively simple statistic, but it may not be easy to find an 



interpretation for a complex one. It would be helpful in this 
regard to be able to derive limiting distributions for BRTSS 
applied to a distribution on trees. It is possible to do this for 
certain statistics, such as the number of cherries [7] or I c and 
N [9]. The methods used in the latter paper are applicable 
to a subclass of the BRTSS, however a substantial amount of 
work must be done on a case-by-case basis. 

We note that the problem of differentiating two distributions 
on combinatorial objects has been approached in a different 
fashion by the statistical physics community. In their case, 
many models have been proposed for the growth of social 
and biological networks and a goal is to confirm or reject a 
certain model given some data. Analogous to the classical tree 
shape statistics, individual means of comparing graphs, such 
as the diameter or the number of subgraphs of a specific type, 
have been described (see, e.g. [18]). A more recent approach 
is to count in some manner the number of many different 
walks on the networks and then feed that information into a 
Support Vector Machine [19] [20]. This approach is similar to 
that described in the present paper in that machine learning is 
used to come up with tests which can distinguish models from 
data, however the actual technique is quite different. Their 
network approach focuses on local structure, while the BRTSS 
in this paper often provide global information. Nevertheless, 
an application of the network approach might provide some 
insights in the tree shape setting. 

In the future we hope to use the methodology presented 
in this paper to expand the applications of tree shape theory 
in useful directions. For example, moderately sophisticated 
models of influenza evolution are currently being used to elu- 
cidate the evolutionary processes which form the remarkable 
imbalance of influenza phylogenetic trees [21] [22]. At this 
point very little of even the classical tree shape statistics are 
being applied for quantitative description. Another potentially 
underdeveloped area is the use of tree shape statistics to detect 
bias in modern tree reconstruction methods on real data; a 
lone article from over 10 years ago [23] forms the complete 
bibliography in this area. 
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